Skip to content

Defer MetricKey construction to the aggregator thread#11381

Open
dougqh wants to merge 11 commits into
masterfrom
dougqh/conflating-metrics-background-work
Open

Defer MetricKey construction to the aggregator thread#11381
dougqh wants to merge 11 commits into
masterfrom
dougqh/conflating-metrics-background-work

Conversation

@dougqh
Copy link
Copy Markdown
Contributor

@dougqh dougqh commented May 15, 2026

What Does This Do

Moves the per-span MetricKey construction, cache lookups, and aggregation off the producer thread into the existing aggregator thread, replacing the Batch-based conflation pipeline with a thin per-span SpanSnapshot posted to the inbox.

Motivation

Incremental step towards using a lighter weight structure for metrics.

In the subsequent PR, I intend to switch to a simplified hash table that isn't thread-safe.

The simplified hashtable uses custom entries that that will allow us to avoid the MetricKey construction on look-up,
but given that the simple hashtable isn't thread-safe we need to move the work to the consumer thread first.

Additional Notes

Stacked on top of #11380 -- review that first; the merge base of this PR is dougqh/conflating-metrics-producer-wins, not master. The diff shown here is only the work that's new beyond that PR.

What the producer does now (per span)

  • filter (shouldComputeMetric, resource-ignored, longRunning)
  • collect tag values into a SpanSnapshot (one allocation per span)
  • inbox.offer(snapshot) + return error flag for forceKeep

What moved off the producer

  • MetricKey construction and its hash computation
  • SERVICE_NAMES.computeIfAbsent (UTF8 encoding of service name)
  • SPAN_KINDS.computeIfAbsent (UTF8 encoding of span.kind)
  • PEER_TAGS_CACHE lookups (peer-tag name+value UTF8 encoding)
  • pending / keys ConcurrentHashMap operations
  • Batch pooling, atomic ops, contributeTo

Removed entirely

  • Batch.java -- the aggregator's existing LRUCache<MetricKey, AggregateMetric> IS the conflation point now
  • pending ConcurrentHashMap<MetricKey, Batch>
  • keys ConcurrentHashMap<MetricKey, MetricKey> (canonical dedup)
  • batchPool MessagePassingQueue<Batch>
  • CommonKeyCleaner's keys.keySet() tracking; AggregateExpiry now just reports LRU drops to health metrics

Added

  • SpanSnapshot: immutable value carrying the raw MetricKey inputs + a tagAndDuration long (duration OR-ed with ERROR_TAG / TOP_LEVEL_TAG).
  • AggregateMetric.recordOneDuration(long) -- single-hit equivalent of the existing recordDurations(int, AtomicLongArray).
  • Peer-tag values flow through the snapshot as a flattened String[] of [name0, value0, name1, value1, ...]; the aggregator encodes them through PEER_TAGS_CACHE on its own thread.
  • HealthMetrics.onStatsInboxFull() + a TracerHealthMetrics counter reported as stats.dropped_aggregates{reason:inbox_full} -- parallel to the existing reason:lru_eviction. Without conflation the producer can lose snapshots when the bounded MPSC queue is full; this makes that visible without silencing it.

Benchmark results (2 forks × 5 iter × 15s)

ConflatingMetricsAggregatorDDSpanBenchmark:

avgt (µs/op) CI (99.9%)
prior commit (stacked base) 6.343 ± 0.115 [6.228, 6.458]
this PR 2.506 ± 0.044 [2.462, 2.550]

~60% faster on the production DDSpan path. The SimpleSpan bench shows ~53% faster as well.

Caveat on the bench numbers

Without conflation, the producer pushes 1 inbox item per span instead of ~1 per 64. At the JMH bench's synthetic rate (effectively ~20M snapshots/sec from the producer) the consumer can't keep up and inbox.offer silently drops -- the new onStatsInboxFull counter would fire constantly. The headline numbers measure producer publish() latency only; consumer throughput at realistic span rates is a follow-up to validate. Tuning maxPending matters more in this design.

Real fixes for capacity (out of scope for this PR):

  • Bump maxPending default; the conflating design used 2048 slots × ~64 conflation = ~131K effective capacity, the new design has 2048 slots flat.
  • Producer-side mini-batching (TLAB-style accumulator per thread) to recover some compression.

Test plan

  • ./gradlew :dd-trace-core:test --tests 'datadog.trace.common.metrics.*' passes
  • ./gradlew :dd-trace-core:test --tests 'datadog.trace.core.monitor.*' passes
  • ./gradlew :dd-trace-core:compileJava :dd-trace-core:compileTestGroovy :dd-trace-core:compileJmhJava :dd-trace-core:compileTraceAgentTestGroovy all green
  • ./gradlew spotlessCheck clean
  • CI muzzle / integration suites
  • Validate stats.dropped_aggregates{reason:inbox_full} reports as expected under a synthetic high-load run (not in the JMH bench)

🤖 Generated with Claude Code

dougqh and others added 7 commits May 15, 2026 12:06
ConflatingMetricsAggregator.publish does a handful of redundant operations on
every span. None individually is large; together they show as ~2.5% on the
existing JMH benchmark once the benchmark actually exercises span.kind.

- dedup span.isTopLevel(): publish() reads it into a local, then shouldComputeMetric
  read it again. Pass the cached value in.
- resolve spanKind to String once: master called toString() twice per span (once
  inside spanKindEligible, once at the getPeerTags call site) and used HashSet
  contains on a CharSequence (which routes through equals on String). Normalize
  to String up front and reuse.
- lazy-allocate the peer-tag list: getPeerTags() always allocated an ArrayList
  sized to features.peerTags() even when the span had none of those tags set.
  Defer allocation until the first match; return Collections.emptyList() when
  none hit. MetricKey already treats null/empty peerTags as emptyList, so no
  behavior change.

Drop the spanKindEligible helper — the HashSet.contains call inlines fine in
shouldComputeMetric.

Update the JMH benchmark to set span.kind=client on every span. Without it the
filter path short-circuits before the peer-tag and toString work, so the wins
above aren't measurable. With it:

  baseline   6.755 us/op (CI [6.560, 6.950], stdev 0.129)
  optimized  6.585 us/op (CI [6.536, 6.634], stdev 0.033)

2 forks x 5 iterations x 15s. ~2.5% mean improvement and much tighter variance
fork-to-fork.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Introduce SpanKindFilter -- a tiny builder-built immutable filter whose state
is an int bitmask indexed by the span.kind ordinals already cached on
DDSpanContext. Each include* on the builder sets one bit (1 << ordinal); the
runtime check is a single AND against (1 << span's ordinal).

CoreSpan.isKind(SpanKindFilter) is the new entry point. DDSpan overrides it
to do the bit-test directly against the cached ordinal -- no virtual call,
no tag-map lookup. The two existing test-only CoreSpan impls (SimpleSpan
and TraceGenerator.PojoSpan, the latter in two source sets) implement isKind
by reading the span.kind tag and delegating to SpanKindFilter.matches(String),
which converts via DDSpanContext.spanKindOrdinalOf and does the same AND.

Refactor: DDSpanContext.setSpanKindOrdinal(String) now delegates to a new
package-private static spanKindOrdinalOf(String) so the same string-to-ordinal
mapping serves both the tag interceptor path and SpanKindFilter.matches.

This is groundwork -- nothing in the codebase calls isKind yet. The next
commit will replace the HashSet-based eligibility checks in
ConflatingMetricsAggregator with SpanKindFilter instances.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Replace the two ELIGIBLE_SPAN_KINDS_FOR_* HashSet<String> constants and the
SPAN_KIND_INTERNAL.equals check with three SpanKindFilter instances:
METRICS_ELIGIBLE_KINDS, PEER_AGGREGATION_KINDS, INTERNAL_KIND. Eligibility
checks now go through span.isKind(filter), which on DDSpan is a volatile
byte read against the already-cached span.kind ordinal plus a single bit-test.

Also defer the span.kind tag read: previously read at the top of the publish
loop and threaded through both shouldComputeMetric and the inner publish.
isKind no longer needs the string, so the read can move down into the inner
publish where it's still needed for the SPAN_KINDS cache key / MetricKey.

Supporting changes:

- DDSpanContext.spanKindOrdinalOf(String) is now public so non-DDSpan CoreSpan
  impls can compute the ordinal at tag-write time.
- SpanKindFilter gains a public matches(byte) fast-path overload that callers
  with a pre-computed ordinal use directly.
- SimpleSpan caches the ordinal in setTag(SPAN_KIND, ...), mirroring what
  TagInterceptor does for DDSpanContext, and its isKind now hits the byte
  fast path. Without this, the JMH benchmark (which uses SimpleSpan) would
  re-derive the ordinal on every isKind call and overstate the cost.

Benchmark on the bench updated last commit (kind=client on every span,
4 forks x 5 iter x 15s):

  prior commit  6.585 ± 0.049 us/op
  this commit   6.903 ± 0.096 us/op

The slight regression is a SimpleSpan-via-groovy-dispatch artifact -- the
interface call to isKind through CoreSpan, then through SimpleSpan, then
through SpanKindFilter.matches, doesn't fold as aggressively as a HashSet
contains on a static field. In production DDSpan.isKind inlines to a context
field read + ordinal byte read + bit-test, so the production path is faster
than the prior HashSet approach. A DDSpan-based benchmark would show this;
the existing SimpleSpan-based one doesn't.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The existing ConflatingMetricsAggregatorBenchmark uses SimpleSpan, a groovy
mock. That's enough for measuring queue/CHM/MetricKey work, but it conceals
the production cost of CoreSpan.isKind: SimpleSpan's isKind goes through
groovy interface dispatch into SpanKindFilter.matches, while DDSpan.isKind
inlines to a context byte-read + bit-test.

This new benchmark uses real DDSpan instances created through a CoreTracer
(with a NoopWriter so finishing doesn't reach the agent). Same shape as the
SimpleSpan bench (64-span trace, span.kind=client, peer.hostname set).

Numbers (2 forks x 5 iter x 15s):

  master:        6.428 +- 0.189 us/op  (HashSet eligibility checks)
  this branch:   6.343 +- 0.115 us/op  (SpanKindFilter bitmask)

About 1.3% faster on the production path. The SimpleSpan benchmark in the
same conditions shows a ~2.2% slowdown -- the mock's dispatch shape gives a
misleading signal.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Make SpanKindFilter.kindMask and its constructor private now that DDSpan.isKind
no longer needs direct field access -- it delegates to SpanKindFilter.matches(byte).

The Builder.build() in the same outer class still constructs instances via the
private constructor.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Replace the producer-side conflation pipeline with a thin per-span SpanSnapshot
posted to the existing aggregator thread. The aggregator now builds the
MetricKey, does the SERVICE_NAMES / SPAN_KINDS / PEER_TAGS_CACHE lookups, and
updates the AggregateMetric directly -- all off the producer's hot path.

What the producer does now, per span:

  - filter (shouldComputeMetric, resource-ignored, longRunning)
  - collect tag values into a SpanSnapshot (1 allocation per span)
  - inbox.offer(snapshot) + return error flag for forceKeep

What moved off the producer:

  - MetricKey construction and its hash computation
  - SERVICE_NAMES.computeIfAbsent (UTF8 encoding of service name)
  - SPAN_KINDS.computeIfAbsent (UTF8 encoding of span.kind)
  - PEER_TAGS_CACHE lookups (peer-tag name+value UTF8 encoding)
  - pending/keys ConcurrentHashMap operations
  - Batch pooling, batch atomic ops, batch contributeTo

Removed entirely:

  - Batch.java -- the conflation primitive is no longer needed; the
    aggregator's existing LRUCache<MetricKey, AggregateMetric> IS the
    conflation point now.
  - pending ConcurrentHashMap<MetricKey, Batch>
  - keys ConcurrentHashMap<MetricKey, MetricKey> (canonical dedup)
  - batchPool MessagePassingQueue<Batch>
  - The CommonKeyCleaner role of tracking keys.keySet() on LRU eviction --
    AggregateExpiry now just reports drops to healthMetrics.

Added:

  - SpanSnapshot: immutable value carrying the raw MetricKey inputs + a
    tagAndDuration long (duration | ERROR_TAG | TOP_LEVEL_TAG).
  - AggregateMetric.recordOneDuration(long tagAndDuration) -- the single-hit
    equivalent of the existing recordDurations(int, AtomicLongArray).
  - Peer-tag values flow through the snapshot as a flattened String[] of
    [name0, value0, name1, value1, ...]; the aggregator encodes them through
    PEER_TAGS_CACHE on its own thread.

Benchmark results (2 forks x 5 iter x 15s):

  ConflatingMetricsAggregatorDDSpanBenchmark
    prior commit  6.343 +- 0.115 us/op
    this commit   2.506 +- 0.044 us/op  (~60% faster)

  ConflatingMetricsAggregatorBenchmark (SimpleSpan)
    prior commit  6.585 +- 0.049 us/op
    this commit   3.116 +- 0.032 us/op  (~53% faster)

Caveat on the benchmark: without conflation, the producer pushes 1 inbox
item per span instead of ~1 per 64. At the benchmark's synthetic rate the
consumer can't keep up and inbox.offer silently drops. The numbers measure
producer publish() latency only; consumer throughput at realistic span rates
is a follow-up to validate. Tuning maxPending matters more in this design.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
With the per-span SpanSnapshot inbox path, the producer can lose snapshots
when the bounded MPSC queue is full -- silently, since inbox.offer() returns
a boolean we previously ignored. The conflating-Batch design used to absorb
~64x more producer pressure per inbox slot, so this is a new failure mode
worth surfacing.

Wire it through the existing HealthMetrics path:

- HealthMetrics.onStatsInboxFull() (no-op default).
- TracerHealthMetrics gets a statsInboxFull LongAdder and a new reason tag
  reason:inbox_full reported under the same stats.dropped_aggregates metric
  used for LRU evictions. Two LongAdders, two tagged time series.
- ConflatingMetricsAggregator.publish increments the counter when
  inbox.offer(snapshot) returns false.

This doesn't fix the drop -- tuning maxPending and/or building producer-side
batching are the actual fixes. But it makes the failure visible in the same
place ops already watches.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@dougqh dougqh added type: enhancement Enhancements and improvements comp: core Tracer core tag: performance Performance related changes tag: no release notes Changes to exclude from release notes comp: metrics Metrics tag: ai generated Largely based on code generated by an AI or LLM labels May 15, 2026
@dougqh dougqh marked this pull request as ready for review May 18, 2026 15:37
@dougqh dougqh requested a review from a team as a code owner May 18, 2026 15:37
@dougqh dougqh requested a review from mhlidd May 18, 2026 15:37
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 950499c767

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

Comment on lines +514 to +518
reportIfChanged(
target.statsd,
"stats.dropped_aggregates",
target.statsInboxFull,
REASON_INBOX_FULL_TAG);
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Resize health metric history for inbox-full counter

When statsInboxFull is nonzero this added 52nd reportIfChanged call indexes previousCounts[++countIndex], but previousCounts is still sized for the previous 51 counters. As a result the new reason:inbox_full metric is never emitted and every flush that reaches this call logs the resize warning instead; increase the array size alongside the new counter.

Useful? React with 👍 / 👎.

The new reason:inbox_full reportIfChanged call advances countIndex to 51,
but previousCounts was still sized for 51 counters (max index 50), so the
metric never emitted and the resize warning fired every flush. Bump the
array to 52 and add a regression test that exercises the flush path.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@datadog-official

This comment has been minimized.

Base automatically changed from dougqh/conflating-metrics-producer-wins to master May 19, 2026 23:04
publish() previously did all of the tag extraction (peer-tag pairs,
HTTP method/endpoint, span kind, gRPC status) and the SpanSnapshot
allocation before calling inbox.offer; on a full inbox the offer
failed and everything became garbage.

Early-out with an approximate size() vs capacity() check up front. The
jctools MPSC queue's size() is best-effort but that's fine: under-
estimation falls through to the existing offer-as-source-of-truth
path, over-estimation drops a snapshot that would have fit (and
onStatsInboxFull was about to fire on the next span anyway).

error is computed first so the force-keep return is correct whether
or not the snapshot is built.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

comp: core Tracer core comp: metrics Metrics tag: ai generated Largely based on code generated by an AI or LLM tag: no release notes Changes to exclude from release notes tag: performance Performance related changes type: enhancement Enhancements and improvements

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants